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Abstract 



Open Computing Language (OpenCL) is a parallel processing language that is ideally suited for running parallel algorithms on 



Graphical Processing Units (GPUs). In the present work we report on the development of a generic parallel single-GPU code for 
£h .the numerical solution of a system of first-order ordinary differential equations (ODEs) based on the OpenCL model. We have 
'applied the code in the case of the time-dependent Schrodinger equation of atomic hydrogen in a strong laser field and studied its 
, performance on NVIDIA and AMD GPUs against the serial performance on a CPU. We found excellent scalability and a significant 
^T) ^speed-up of the GPU over the CPU device which tended towards a value of about 40 with significant speedups expected against 
multi-core CPUs. 
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1. Introduction 
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Exploration of the fundamental processes that occur when 
atomic and molecular systems are subject to extreme conditions 
is currently a major research area. Theoretically, it is a huge 
^^task to treat the exact time-dependent (TD) response of a multi- 
^ (~| electron system subject to a strong electromagnetic (EM) field 
<"*> |by ab initio methods. In response to extensive experimental 
'achievements using high-intensity TkSapphire laser sources in 
the long wavelength regime, a very successfull approach that 
J> adopts the single-active-electron (SAE) approximation was ap- 
plied to the atomic and molecular cases QJJ] . For systems of 



O 



only two electrons, such as the negative hydrogen ion, helium 
and molecular hydrogen, direct ab-initio solutions of the time- 



dependent Schrodinger equation (TDSE) have appeared in the 
.early nineties (for a review see ref. J^). Since then, computa- 
tional performance has increased steadily and as a result these 
i methods have reached a high level of accuracy, efficiency and 
J" "reliability, tackling successfully the very demanding theoreti- 
• — h cal problem, of single and double ionization of helium at 390 
[X] and/or 780 nm Jill. 

^ . Recently, the realization of short-wavelength sources, through 
- - "the free-electron laser of high-order harmonic generation tech- 
niques, which deliver brilliant radiation in the soft- and (in the 
immediate future) hard X-ray regime have initiated new chal- 
lenges in the strong-fields atomic and molecular physics J5[0]. 
Theoretical and computational approaches to tackle these chal- 
lenging problems have been developed in atomic and molec- 
ular physics studies or are underway. Those include variants 
of time-dependent Hartree-Fock (TDHF) QHSEl as well as 
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extensions of the traditional time-independent R-matrix method 
to the time-domain such as, the R-matrix Floquet ap proa ch | Till . 
TDSE approaches fully based on R-matrix theory jl2l Il3l Il4ll 
and the rec ently developed R-matrix incorporated time (RMT) 
method lfl5lll6l 17 1. Thus it appears that there is a consistent in- 
terest in the development of computationally tractable methods 
able to treat multi-electron systems with the least approxima- 
tions possible. 

In the past two decades there have been several advances in 
various directions in the computational infrastructure. When a 
computationally demanding problem is being tackled the entire 
computing environment should cooperate in a coherent manner. 
This includes reliable and robust numerical libraries, sophisti- 
cated compilers, high speed networks, visualization software, 
technical support and training together with high processing 
rate and fast memory. The emergence of Central Processing 
Unit (CPU)-based parallel architectures allowed the develop- 
ment of High Performance Fortran, various parallel versions of 
C++ and the successful usage of Message Passing Interface 
(MPI) and Open Multi-Processing (OpenMP). As it is not the 
purpose of this work to elaborate on the available techniques 
for CPU-based computational paradigms we will focus on an 
alternative possibility of growing interest, namely the use of a 
heterogenous computational enviroment which involves the use 
of General Purpose Graphics Processing Units (GPGPUs) for 
an efficient and low-cost distributed hybrid computing system 

ill. 

The GPGPU programming model has appeared recently 
due to the availability of high-level compilers, through C-like 
languages such as CUDA and OpenCL C as well as PGI CUDA 
Fortran, where commands are addressed directly to the Graph- 
ics Processing Unit (GPU) CIS HI, FortranCL is an OpenCL 
fortran interface implementation which originated as an inter- 
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face within the quantum chemistry project for Octopus, a TDDFT 
package 122l 12311 . This allows OpenCL functions to be called 
from Fortran code. The major advantage of the architecture 
is the large number of, what are effectively, cores present on 
a GPU. Thus a powerful desktop computing environment ap- 
pears feasible, provided some current drawbacks are resolved 
such as the large disparity between double precision and single 
precision performance, possibly due to the lack of dedicated 
double precision arithmetic units, and the availability of high- 
performance library routines. As a result of the possibilities 
with the GPGPU platform, it represents a hot topic within com- 
putational physics. Two main computational platforms for GPU 
computing are currently in the mainstream interest; namely the 
CUDA and OpenCL frameworks. At the moment CUDA is of 
heavy use in a number of different scientific areas, but interest 
in OpenCL is increasing, with tools also available to convert 
CUDA code into OpenCL code such as the program Swanl24l. 
OpenCL is a language that was designed to suit the parallelism 
of GPUs. It is, in essence, very similar to CUDA but in terms 
of features within the framework there are some significant dif- 
ferences arising from CUDA being limited to a particular set of 
hardware from a particular manafacturer whilst OpenCL is not. 

The purpose of this work is two-fold. The first is to pursue 
the development of a theoretical method to tackle demanding 
computational problems in the area of complex quantum sys- 
tems under intense and ultrashort radiation fields. The second 
is to present a computational model which is only, we believe, 
in its starting phase, namely the development of a parallel com- 
putational model which does not discriminate between GPU 
and CPU architectures. In this sense, the present computational 
model is designed and is able to run on both CPU and GPU- 
based systems. To this end, the computational framework that 
we chose is based on the OpenCL language. Though the us- 
age of GPUs is already common within fields such as quantum 
chemistry 12511 and usage is flourishing in fields as varied as 
fluid dynamics and magnetohydrodynamics (see the introduc- 
tion from 12611 and citations within) and statistical physics 127 1 
with some usage appearing in fields such as in interacting many 
particle systems 12811 . 

Since CUDA is a more mature platform there exist routines 
which can optimize existing codes such as the use of an existing 
accelerated FFT routine such as the FFTW3 library used in Ref 
l29ll . It is worth noting that most implementations using GPUs 
are on the CUDA platform. Thus for this additional reason the 
present OpenCL implementation represents an important con- 
tribution in this newly emerged field. 

The paper is organized as follows. In Sec. |2] a basic de- 
scription of the OpenCL and GPU concepts and terminology 
is given. In Sec. [3] we formulate the physical problem un- 
der question in mathematical terms. In Sec. |4] we present the 
OpenCL implementation specific to the solution of a system of 
ODEs, while in Sec. [5] we apply our approach to the case of an 
hydrogen atom and present the benchmarking results. Finally 
we have relegated some the details about the hardware tested 
and the specific numerical algorithms employed in this study to 
the appendicies. 
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Figure 1: An AMD GPU Model based on the data from 



Appendix A the ALU, each pair of which are grouped into 
a single double precision processing element, are in red, the 
transcendental unit, in orange and marked with the T, is not 
used. Each grey box represents one compute unit, of which 
there are 18. The pool of registers which form private memory 
are shared amongst processing elements in a compute unit. The 
global cache caches global memory for use within the compute 
unit, it is not accessed explicitly. Also shown within the com- 
pute unit is the local memory which is accessed explicitly and 
is the medium for communication within a compute unit. The 
global and constant memory shown are accessed by processing 
elements within compute units. 



2. GPU programming and the OpenCL framework 

GPUs are a type of compute device in OpenCL terminol- 
ogy. GPU architectures have blocks of processing elements. 
Processing elements are similar to cores except for a few key 
differences since they are, or effectively are, arfhimetic logic 
units (ALUs). A processing element will have access to a cer- 
tain amount of memory which it exclusively accesses. This 
is known as private memory. Groups of processing elements 
execute in a SIMD fashion and may share a common mem- 
ory which can be treated as local memory as described later. 
These groups are known as compute units. GPU branch gran- 
ularity is coarse grained because of the SIMD design. GPUs 
typically have a slower clock speed (700-900MHz) in compari- 
son to CPUs. Figure[TJdemonstrates a typical configuration for 
an AMD GPU. The specific architectures provided by different 
vendors may vary, but the abstraction provided by OpenCL will 
hold. 

Global memory is available for access to all compute units. 
Constant memory is apart of global memory which is not changed 
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by processing elements. A cache for global memory may also 
be available. Global Memory is typically not integrated onto 
the same chip as the GPU. For a CPU, RAM typically uses 
DDR2 and the newer DDR3 whilst a GPU typically accesses 
GDDR5 global memory. For AMD hardware GDDR5 memory 
has twice the bandwidth of DDR3 memory l30ll . 

Communication between the GPU and CPU typically oc- 
curs over a PCI Express xl6 connection. For the V7800 this 
gives a theoretical maximum bandwidth of 8 GB/s while the 
peak realizable bandwidth is 6 GB/s. 

2.1. OpenCL 

OpenCL is a royalty free open standard. Initially developed 
by Apple® Inc, the standard is being actively developed and 
worked upon by the Khronos group, a large multi-vendor con- 
sortium which includes companies such as NVIDIA®, AMD®, 
IBM®, ARM®, Intel®, Texas Instruments®, Sony® and others. 
The current implementations of OpenCL from Intel, AMD and 
NVIDIA are based on the 1.0 and 1.1 standards. The 1.2 stan- 
dard was released on the 15 of November 201 1. 

OpenCL distinguishes between two types of code in any 
OpenCL program; the host code and the OpenCL C code. 

All code that is written in standard programming languages 
such as C or Fortran can be regarded as host code. A regular 
program with no connection to OpenCL can be viewed as con- 
taining entirely host code for example. The host code interacts 
with OpenCL purely through function calls to the OpenCL li- 
brary. This means that any compiler can be used to compile the 
host code as long as it can link against the OpenCL library. 

The OpenCL C code is written in an OpenCL variant of the 
latest ANSI/ISO standard of C known as C99. The major dif- 
ferences between OpenCL C and C99 are the restrictions placed 
on the language. A key restriction is the lack of recursion due to 
GPU hardware issues and also that two or more dimensional ar- 
rays must be treated as one dimensional arrays when being used 
as arguments for kernel functions. Although complex numbers 
are supported by the C99 standard they are not implemented 
in OpenCL C, instead, the user can create a complex structure 
containing two double precision elements, it is then a relatively 
simple matter to define the relevant series of complex multipli- 
cation functions. This, however, is an undesirable additional 
step. It is preferable if optimized implementations were used 
implicitly such as in C99. Other restrictions are listed in the 
OpenCL specification l3lll . The OpenCL C code is the code 
that will actually be performing a particular computation on a 
particular target such as a GPU or a CPU. 

Whilst CUDA is portable amongst most operating systems, 
OpenCL is portable in the greater sense of not being limited 
to specific hardware as well as operating systems. Support is 
not dependent on a single vendor. Possible compute devices in 
OpenCL are not just limited to GPUs and CPUs, they can also 
include FPGAs, DSPs, the IBM® Cell architecture and many 
more. 

OpenCL C code will execute on any architecture but, in 
practice, it will require a slight code modification or possibly 
a partial rewrite to achieve good performance from one archi- 
tecture to the next. 



OpenCL C code is compiled during the runtime of the host 
code. The OpenCL C code can be specifically targeted to a 
particular instance of a problem; some aspects are known only 
at runtime of the host code. This information can be used at 
compile time for the OpenCL C code and thus the code can be 
optimized for that particular instance. In this way, the compiler 
can take advantage of what is known at runtime of the host code. 

Memory objects such as one dimensional arrays can be cre- 
ated for use by the device code by function calls to the OpenCL 
library. A handle is returned to the object by the library which 
can then be used to refer to the object in future function calls. 

OpenCL as a framework provides for the execution of func- 
tions known as kernels, which are written in OpenCL C. A ker- 
nel is not directly called by the host code, instead, a call to a 
specific kernel with specific memory objects as arguments is 
placed in a queue on the host device when the clEnqueueN- 
DRangeKernel() function is called. The particular implementa- 
tion of the OpenCL standard will take care of all further details. 
For example, the implementation will decide when to pass a 
particular batch of function calls queued from the CPU to the 
hardware scheduler that is present on a GPU. The queue is said 
to be asynchronous. 

Since the objective is parallelism, the aim is for multiple in- 
stances of the same kernel to be simultaneously executed with 
independent data so as to spread the workload. The hardware is 
set to assign instances of this execution, known as work items, 
with particular identification numbers. Three sets of identi- 
fication numbers are given; the local, group and global IDs. 
OpenCL combines work items into work groups. The minimum 
size of a work group for an AMD GPU is 64 work items. This 
minium size is known as a wavefront in AMD terminology. For 
NVIDIA the minimum size is called a warp. AMD GPUs cur- 
rently execute a half wavefront at a time on a compute unit for 
double precision instructions. The local ID of a work item rep- 
resents its place within a work group. The purpose of the group 
ID is to represent a particular work groups position in relation 
to the other work groups. The global ID represents the position 
of a work item in relation to all other work items. 

For a specific work group size Ncroup the global ID is equiv- 
alent to: 

IDclobal = IDcroupNcroup + IDlocciI 

In this way a work item knows its place in the order of 
things in a manner similar to the concept of topologies in MPI. 
That is, there exists what can be thought of as a local topology 
between work items in a work group and a global topology be- 
tween work groups in the domain of the problem. The local and 
global topologies can be one, two or three dimensional in their 
layout. 

A work item can only communicate with other work items 
in the same work group. Unlike in MPI, the creation of virtual 
topologies is not built in by the framework though the equiva- 
lent can be implemented by an interested OpenCL C program- 
mer. A work item communicates with other work items in it's 
work group through the use of local memory; low latency mem- 
ory that may be dedicated to a particular compute unit or global 
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memory which is remapped to the work group. This communi- 
cation approach through memory is similar to that of OpenMP. 

There is a limitation on communication between work items 
in different work groups; they cannot communicate with each 
other during the execution of a kernel. This limitation is due to 
the execution of the kernel by the many work items, the coher- 
ence of global memory cannot be known since the order of ex- 
ecution of the work items is determined by the hardware sched- 
uler and not the programmer. 

When one queues kernel calls via an in-order execution queue 
it can be guaranteed that at the start of a function call all work 
items executing a particular kernel see a consistent view of 
memory and so it can be said that the work items have been 
synchronized. 

3. TDSE projected on the stationary system's eigenstate ba- 
sis 

Let us consider a multielectron (A^-electrons) atomic or molec- 
ular system described through the hamiltonian operator Hn(?n) 
with, the system's electron's positions ?n = (ri, Y2, ■•, !"#). We 
assume that the eigenstates of the hamiltonian, <t> r (rAi), have 
been calculated as the solutions of the Schrodinger equation: 



(H N (r N ) ■ 



E 7 ) O r (rjv) 



0, 



(1) 



where with the index y we represent all the quantum mechan- 
ical observables required to completely characterize the states. 
Let us now consider the TDSE of the above system subject to 
an external time-dependent field represent by the potential op- 
erator V(?n, t). The TDSE of the system in the presence of this 
external field is written as: 



i^Hrm t) = [H(r N ) + V(r N , f)\ i/r(r N , t), 



(2) 



supplemented with the initial condition ^(rjv, f o) = <Ao- Thus, 
the main goal is to calculate the time-dependent wave function 
of the system, given the hamiltonian of the unperturbed sys- 
tem and the external potential V(t). To this end, since any phys- 
ical state of a quantum mechanical system can be expanded in 
terms of it's eigenstates, the jV-electron wavefunction is writ- 
ten as: 



Cy(t),<&y(r N ), 



(3) 



where y, in principle, includes both the bound and continuum 
states of the spectrum. At this stage, towards developing a 
method of calculating the TD wavefunction, we consider the 
standard approach which assumes that the system is enclosed in 
a box. Having assumed this, the bound and the continuum solu- 
tions of the system can now both treated with a common index- 
ing representing a now discretized spectrum. Formally, projec- 
tion of the known discretized channel states <t> y onto the TDSE 
and subsequent integration over all spatial variables rjy will lead 



to the following set of coupled partial differential equation for 
the radial motion in channels y: 



ijCy{t) = EyCy(t) +Y j Vyy(t)Cy(t), 



(%\H N \O r ) 
<<D y |V(f)|O r <> 



(4) 

(5) 
(6) 



By properly ordering the coefficients C y {t) into a column vector 
C(f) and the diagonal (E 7 ) and non-diagonal (V r , r <(f)) elements 
into a square matrix H(f) we may rewrite the TDSE governing 
the system-field dynamics, in matrix representation, as, 



iC(t) = H(0C(0, 



(7) 



supplemented with the initial condition C(0) = Co. The latter 
set of equations for the coefficients, which represents a system 
of coupled ordinary differential equations (ODE), are subject 
to numerical solution. The ODE form of the TDSE, no matter 
which system we have, allows us to utilize our solution algo- 
rithm at a very general level. Within this eigenstate represen- 
tation of the system's TD wavefunction only two kinds of dy- 
namical quantities are required for the solution; the eigenener- 
gies and transition matrix elements between the system's eigen- 
states. The key point is that all the information about the exact 
nature of the system, whether multi-electron or not, whether 
atomic or molecular, is contained in the values of the eigenen- 
ergies and the matrix elements together with the required selec- 
tion rules for the transitions. It is for this reason, that any TDSE, 
atomic or molecular, can be formally re-expressed as a system 
of ODEs that the present computational algorithm is especially 
important, since it is designed to accept as input elements, ex- 
actly the matrix elements of H which uniquely define coupling 
between the system and field. 

3.1. Atomic/molecular system in linearly polarized radiation 
within the dipole approximation 

Since we are interested in atomic or molecular targets under 
EM fields one very important simplification can be achieved if 
we assume a linearly polarized radiation field along some axis, 
which without any loss of generality we may assume to be the 
z-axis in a Cartesian coordinate system. This assumption is 
used to determine the structure of the matrix involved in the 
TDSE system of equation (0. Now, let us make the channel in- 
dex y more specific; where y represents a solution of the hamil- 
tonian operator. The total angular momentum quantum number, 
given by L with it's component along the z-axis L z and the to- 
tal spin given by S and it's component along the z-axis S z . 
Thus, we write y as: y — (ELS MiM$). This is the so-called 
LS representation of the atomic states which is well suited to 
light atoms. Let us assume that the system starts from the state 
®o = \Ep, Lq, S q, , M s ) It is well established, through the 



Wignet-Eckhart theorem [32], that in the dipole approximation 
for the coupling of the radiation field the non-vanishing ele- 
ments are between these states with the same total spin and the 
same magnetic angular and spin quantum numbers. Thus we 
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Figure 2: The banded structure of the Hamiltonian is shown. 
This structure holds if we assume we have a linearly polarized 
field which interacts with an atomic or molecular target in the 
dipole approximation. The sub and super diagonal blocks con- 
tain N/xNi transition elements. The diagonal blocks are them- 
selves diagonal with the field-free eigenvalues for each L in the 
diagonal. 

have for the dipole transition matrix elements: 

D 7tY = (E y ,L 7 ,S y M Ly M Sj \D\E Y ,L Y ,S Y M Lr ,Ms 7> ) 

= Dy : y6L 7t L r , ± i6s r ,Sy(>M L ,ML y ^M s ,Ms y , (&) 

Therefore we get a structure for the matrix representation of 
H{t), which is based on very general terms, as shown in Fig- 
ure [2] In this figure we represent the case where the maxi- 
mum total angular quantum number that was considered was 
four (Ly„ rat = 4). The sub and super diagonal blocks contain the 
dipole transition elements between states having L y and L y + 1 . 
The diagonal blocks are themselves diagonal with the field-free 
eigenvalues for each L y . This banded structure is very general 
and it can be shown that this can also describe the interaction of 
radiation fields with molecular targets as well. For example in 
the case of diatomic molecules and in the fixed nuclei approx- 
imation the set of quantum operators required to describe the 
interaction with linearly polarized fields along the interatomic 
axis are the hamiltonian operator, the projections of the angu- 
lar quantum number along the interatomic axis (A) and the spin 
quantum number alongside its projection along the interatomic 
axis. In other words the y channel should be represented as 
y = \EK,S,S z ). 

4. The OpenCL GPU computational framework 

Ordinary differential equations of any order can be repre- 
sented as a system of coupled first-order differential equations. 
The algorithms described in the present section solve this generic 
problem for a system of N equations: 

C(r) = M(f)C(f), (9) 



where C(f) is the vector containing the N unknown complex 
coefficients and M(f) = -/H(f), where H(f) is a NxN symmetric 
matrix. At this point, it is worth noting that although the specific 
discussion is for the case of TDSE calculations, the framework 
can easily be applied to ordinary differential equations of the 
kind shown in Equation ([9j- 

Essentially the derivative is calculated through efficiently 
implementing the following matrix-vector multiplication: 

-i[E d +D(»]C 

Due to the neighbouring L y states being coupled (Figure 
|2j, we have a nearest-neighbour computational problem for the 
calculation of the matrix-vector operation. We can express the 
right hand side of Equation (|7]i in terms of eigenstate coeffi- 
cients for blocks of angular momenta Cz, r (f): 

H(f)C(f) = Yj [K CL r + D L r -hL y {t)C L ^ + £> Ly + Ur (f)C Ly + 1 ] , 

Ly 

where E d h is a diagonal matrix containing the field-free eigen- 
values of the eigenstates of the angular momenta block L y and 
Oz_-i,£, and T)i+\i are matrices that contain the dipole matrix 
elements that couple the states from the L y - 1 and L y + 1 eigen- 
states to the L y eigenstates respectively. The coupling terms are 
time dependent. 

Since we calculate the derivative by a matrix vector calcu- 
lation, the Taylor series and Runge-Kutta based methods are 
limited by the performance of this computation. 

For the Taylor series propagator the number of synchroniza- 
tion points is equal to the order of the problem. For a Runge- 
Kutta propagator, without specific optimizations such as those 
required for the Dormand-Prince algorithm, the number of syn- 
chronization points is equal to the number of stages plus one. 
The number of stages in a Runge-Kutta algorithm is greater 
than the order for methods with more than 4 stages. 

Having many synchronization points per order has two ma- 
jor negative effects. It increases the coding complexity since 
the calling of the different kernel functions has to be accounted 
for and it also decreases the ability for optimizations to be im- 
plemented since breaks in the execution stiffle latency hiding 
attempts. 

With the explicit Runge-Kutta methods three distinct ker- 
nels are required; 

1 . A kernel to perform the vector sum before the derivative 
calculation as shown in Equation lB.4l 

2. A kernel to perform the derivative calculation in Equation 

EH 

3. A final kernel to sum all the derivatives in Equation lB.3l 

The Taylor series, on the other hand, requires only one ker- 
nel which performs both the derivative calculation and the ad- 
dition to the solution. Since no optimizations have been imple- 
mented at present it is expected that both methods should have 
the same execution time if the same number of derivatives are 
being calculated by performing matrix vector calculations. 
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4.1. Algorithm for Splitting up Generic Work 

An algorithm is necessary that splits up a workload of Nwork 
units into a specific number of units given by N worker- A worker 
can be a work item or a work group. We get the start position 
for a specific worker: 

Remainder <— Nwork (mod Nworker) 
Div lN Work „ 

S tart <— Div * IDworker 

if ID-worker - Remainder then 

S tart <— S tart + IDworker 

else 

S tart <— S tart + Remainder 
end if 

Now a check is performed to make sure a worker has not 
been assigned a value that is out of range of the available work. 
If this over-assignment has occured the amount of work is ad- 
justed for the worker. 

if ID-worker < Remainder then 

; <- 1 
else 

i <-0 
end if 

Div <— MIN(N Wo rk - S tart, Div + i) 

End — S tart + WorkPerWorker 
This stops workers accessing unallocated memory and also stops 
workers from performing duplicate calculations. If an exact 
number of workers is chosen so that the division is assured to 
be correct then this is unnecessary. 

Splitting up Work Groups Amongst Blocks of Work. A block of 
work is treated as a series of tasks that involve identical instruc- 
tions being executed but with different data. Due to occupancy 
issues it may not be ideal to assign a full block of work to one 
work group. The following algorithm was created to perform 
this split up calculation: 

if Ncnmp S < Nwork B then 



ID C 
N, 



roups 
Group B 







1 



Call main algorithm in section I4T 
else 



N, 



Groups 



Groups I Work 



(mod Ncroup B ) 



IDf} roll p B * ID(} r0U p 
S tart <- ID C roup/N arml „ B 
End <- MIN {Start + l,N WorkB ) 
end if 

Where Noroups is the number of work groups available, IDcroup 
is the ID of a particular work group and Nwork B is the amount 
of blocks of work to split up. After the algorithm has finished 
each work group will be associated with a particular block of 
work which is denoted by an ID IDo roupB . The number of work 
groups in the block is given by Ncroup B ■ 

The call to the main algorithm is done where N G roups be- 



comes Nworker, ID, 



Group 



becomes IDworker and Nwork is still 



used as the amount of work. 



Splitting up a Block to Work Items. For dividing up a block of 
work which we gave a Block Id of IDgwup b to, amongst work 
items in a few work groups Noroup B > firstly we must split the 
work available between the work groups, so the following is 
done: 

TotalWork <- N Grollp * N GroupB 

IDcroupe <— IDlocal + Ncroup * IDjilock 

Call main algorithm in section |4~T1 
Here we assign a particular ID ID GroupB to each work group 
within the Block of work with ID IDsiock- Now we can call the 
main algorithm to divide the section of the block assigned to 
each work group to the individual work items. 

5. Application to the case of atomic hydrogen in a laser field 

In this case the y channel index is replaced from the en- 
ergy, the angular momentum and its component along the field 
polarization axis. We ignore the spin operators and thus we 
have for the eigenstates of atomic hydrogen y = (elmi), where 
e the energy eigenvalue and I the angular momentum quan- 
tum number. The continuum is discretized and together with 
the discrete bound states 113 311 . Then, the eigenstate basis of 
the field free hydrogen hamiltonian is the partial wave basis 
O y (r) = <r | y) = \P ey i y {r)Yi ymh (f) where Y bni (r) are the spher- 
ical harmonics. 



V(r,t) = Y J Cy(W r (r). 



The eigenstates of atomic hydrogen are coupled to each other 
by a strong pulsed laser. This laser field couples states that 
differ in angular momentum by 1 unit while the magnetic quan- 
tum number was set to zero (see Figure[2]). We set the magnetic 
quantum number to zero because we assume that the initial state 
was the ground state of hydrogen where / = and therefore 
mi — and since mi is constant we do not need to take it into 
account. There are 649 eigenstates associated with each angu- 
lar momentum L in the basis that is used for the computations in 
this paper. Nine of these eigenstates are boundary states of the 
B-Spline basis used which are fixed at 0; a total of 640 states 
then are explicitly represented in the calculations for each an- 
gular momentum. The population of the continua represent the 
level of ionization after the laser field has passed. 

The EM field was modelled by a sine squared pulse, linearly 
polarized along the z-axis: 



E(f) = zE sin z (—t)sin(tot) 
2n 



(10) 



where co is the photon frequency and n is the number of cy- 
cles per pulse. The propagation was performed in the veloc- 
ity gauge where the dipole operator is expressed as D v = -p ■ 
A(f)/c. For the velocity gauge a five point gaussian quadra- 
ture integrates the E field to give the vector potential A(f) = 
- J f rff'E(f') at each time step. The integration was found to 
perform as expected by comparing it to an analytical expression 
for a sine squared pulse with a particular photon frequency. The 
method works equally well for length gauge calculations where 
the electric field E(t) gives the time dependence. 
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The present GPU implementation of the Taylor and Runge- 
Kutta propagators (see appendix) was used for calculations in 
the case of atomic hydrogen. The accuracy and precision of the 
propagators was verified by comparing the photoelectron spec- 
trum (PES) of the system to a known working propagator. The 
propagator is based on a NAG Runge-Kutta based solver. Since 
above threshold ionization (ATI) has occurred the photoelectron 
spectrum is distinct. 

In terms of the particular OpenCL implementation the split- 
ting of a block to work groups was made where the number of 
blocks of work Nwork B mentioned in section |4~T1 corresponds to 
the number of angular momenta L lo , when we are using the ba- 
sis representation. IDq TOUPb is the ID for a particular angular 
momenta L. 

For division of the work initially the algorithm in section l4~Tl 
is called. This will assign Ncroup„ work groups to each angu- 
lar momentum block of coefficients Cl- Following this, a call is 
necessary to divide the individual coefficents in Cl amongst dif- 
ferent work items. This is done through a call to the algorithm 
defined in section |4~T1 A choice of number of work groups and 
work group sizes was made such that for every coefficient there 
would be one corresponding work item. 

In the benchmarks shown the matrix was treated as a very 
large one-dimensional array. Each diagonal matrix block E® 
was passed followed by the related superdiagonal dipole ele- 
ment block Di in row major form. In this form the superdiag- 
onal blocks were transferred to the GPU but the subdiagonal 
blocks were not represented. Since the matrix is Hamiltonian 
the subdiagonal block is not necessary. An implementation was 
also made where both subdiagonal and superdiagonal blocks 
were present although the runtime was longer. 

5.7. Benchmarking Results 

Since OpenCL allows for both GPU and CPU execution we 
have benchmarked GPU execution on an AMD FirePro v7800, 
a single GPU on a NVIDIA Tesla SI 070 Computing System 
node and a dual core Intel Xeon. 

In comparing the Taylor propagator a specific step size and 
order was chosen so that for every computation it can be guar- 
anteed that the propagator will maintain unitarity. The step size 
chosen does not represent the optimal choice and so should not 
be used in comparison to other methods. What is of interest is 
how the method scales as the work size is linearly increased. 
Since the method is nearest neighbour the computational over- 
heard for the simulation also rises linearly. Any deviations from 
this linearity would be due to the limitations in the hardware or 
algorithms used. 

By chosing to benchmark along the number of angular mo- 
menta in the basis set, the computational cost of the problem 
can be increased linearly by increasing the number of angular 
momenta. The computational cost is linear as the matrix vector 
calculation is, in this particular application, a nearest neighbour 
problem. The size of the Hamiltonian in terms of number of 
double precision elements is (TV/ + Nf)l where TV/ is the num- 
ber of pairs of double precision elements required to represent 
the coefficients for each angular momenta. For example, for 




Electron Kinetic Energy (eV) 



Figure 3: Shown is a sample PES of simulations with 20 eV 
photons with a pulse of intensity 1 x 10 l4 Wcm~ 2 of the three 
propagators that were compared. A high agreement is seen be- 
tween the classic RK4 and the Taylor propagators. The Runge- 
Kutta-Fehlberg is markedly different from the other two meth- 
ods which mostly overlap. 



the case of TV; = 640 and I = 0, ...,20 then the hamiltonian 
has 8615040 double precision values which require an array of 
about 65 MB. 

An approximate comparison between the Taylor propaga- 
tor and the Runge-Kutta method was made by comparing a 
10th order Taylor propagator to a classic 4th order Runge-Kutta 
propagator and the 4th (5th) order Embedded pair Runge-Kutta- 
Felhberg (RKF) method. The 5th order solution was chosen 
from the RKF method. It has been noted in the literature that 
high order Taylor propagators with large step sizes perform bet- 
ter than lower order Taylor propagators with smaller time steps 
1 3411 . As a result of this a 10th order Taylor propagator was 
chosen. So we can compare like with like the step size of the 
Taylor propagator was altered so that an equal number of ma- 
trix vector calculations would be performed. Similarly, the 5th 
order solution from the RKF method was also performed with 
an adjusted time step. 

As can be seen from Figure [4] there was no major discrep- 
ancy in the runtime of the Taylor and Classic RK4 propagators 
but there was a major discrepancy with the RKF 4(5) propa- 
gator which we attribute to it being an embedded pair method. 
The RKF 4(5) propagator, of which we use the 5th order solu- 
tion, also consistently deviates in the photoelectron spectrum, 
an example of which is seen in Figure[3] All three methods did 
contain the expected structure within the PES, but the RKF 4 
(5) method deviates in the expected intensities. By comparing 
the unitarity of the solution of the methods a lower bound on 
the error can be obtained. The Taylor propagator kept unitarity 
to a level of 1.9 x 10" 14 , while the classic RK4 method kept 
unitarity to 2.3 x 10~ n and the RKF 4(5) method deviated from 
unitarity by 2.5 x 10~". Step sizes for each method were cho- 
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Classic RK4 
Taylor (10th Order) 
RKF4{5) 




16400 21400 
Total Number of Equations 



16400 21400 
Total Number of Equations 



Figure 4: A graph of the performance of a 10th order Taylor, 4th 
order Classic RK4 and 5th Order Runge-Kutta-Fehlberg propa- 
gators with time steps of 6.25 x 1CT 3 , 2.5 x 1(T 3 and 3.75 x 1(T 3 
respectively. The time steps were chosen so that the same num- 
ber of matrix vector calculations would be performed in all 
cases. Since the Fehlberg method is a multi-order method it 
requires slightly more computations and so does not scale as 
well as step-size control is not implemented. 

sen so that an identical number of matrix vector calculations 
would be performed. The performance figures should not be 
used to decide on the choice of method, rather it is used here 
to demonstrate that the number of matrix vector calculations, 
which corresponds to the number of kernels queued, appears 
to be the primary factor for deciding the runtime speed of the 
algorithms. 

It can be seen from Figure|5]that for the NVIDIA GPU, 192 
work items gives the greatest reduction in runtime whilst for the 
AMD GPU, 64 work items gives the best performance in these 
circumstances. The AMD GPU has a highly linear increase in 
run time as expected from a consistent use of the computational 
resources. 

When the GPU results are compared to the serial CPU re- 
sults a clear trend is seen (Figure [SJ. The GPU based simula- 
tions using OpenCL scale better than the CPUs; The runtime 
within the region shown in the figure for the particular pulse 
described, where x is the number of double precision elements 
in the vector of coefficients, is: 

tlNTEL(x) = 0.14* -170 
tAMD(x) = 0.0032* + 14 
tNVIDIA(x) = 0.010x + 9.3 

The CPU timing must break down for smaller situations but 
this is unimportant since the number of explicit states is 640 
this means the smallest possible vector of coefficients is 2560 
elements. 

Figure [7] indicates the general trend which can be extrap- 
olated from the above equations: the speedup for the AMD 
device tends towards a 40 times speedup whilst the NVIDIA 
device tends towards a 14 times speedup as the problem size 



Figure 5: A graph of the performance of an AMD v7800 in 
comparison to one GPU compute device in the Tesla S1070 for 
a tenth order Taylor propagator. Shown is the effects of sev- 
eral different configurations of work items in each work group; 
the work group size is shown in brackets. The optimal num- 
ber of work items per work group is architecture dependent. 64 
was optimal for the AMD GPU but for the NVIDIA GPU 192 
work items per work group was optimal. A step size of 0.005 
was chosen. The number of equations indicates the number of 
real equations, that is every complex equation consists of 2 real 
equations. 

4500 t 

4000 - 
3500 - 
3000 - 

o" 2500- ^tr^ 
| 2000 - 

1500 " -*-S1070(192) 

-*-Xeon (Serial) 

1000- *-V7800(64) 

500- 

6400 11400 16400 21400 26400 

Total Number of Equations 



Figure 6: The runtime in seconds of the best performing 
NVIDIA and AMD configurations from Figure [5] with an In- 
tel Xeon. 

increases. If, in what is most likely an overly optimistic sce- 
nario, one took the CPU scaling to be linear with the number 
of cores this would still provide a speedup of the GPU of an or- 
der of magnitude in comparison to a multi-core system which, 
incidentally, would be more expensive to purchase. With the 
V7800 card the relationship would terminate at a matrix which 
can fit into an array of size 256 MB because the largest single 
block of memory allocatable on the device is 256 MB. 
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13960 18960 23960 

Total Number of Equations 



the matrix vector operations represent the most significant com- 
putational bottleneck in the method. 

6. Conclusions 

A vast number of problems can be formulated in terms of 
a system of first-order ODEs. For propagators in which matrix 
vector calculations represent a significant bottleneck, signifi- 
cant runtime reductions can be achieved by the use of GPUs 
through the OpenCL language. A number of strategies for op- 
timization exist in OpenCL which we discussed briefly for our 
particular case. Optimizing the existing code will require fur- 
ther work for an expected further order of magnitude improve- 
ment in runtime scalability. It also goes without saying that, 
with improvements in compilers and hardware, future trends 
should be for fine-tuning optimizations to be performed by so- 
phisticated compilers and hidden behind generic functions. 



Figure 7: AMD V7800 and NVIDIA T10 GPU speedups in 
comparison to the runtime on a single core of an Intel Xeon 
CPU. The speedup factor here is the ratio of runtimes jgj^-. 



5.2. Optimization approaches 

Although no optimization has been performed, this repre- 
sents a future line of work when the need for further runtime 
reductions becomes an issue. Aligning memory accesses is the 
first step in any optimization. It makes certain that memory 
accesses occur on word aligned boundaries. This is achieved 
by ensuring that each row of the Hamiltonian and each angular 
momenta block of the vector C(?) are aligned. 

If a large cache is available on the GPU then prefetching can 
be enhanced by explicit cache functions available in OpenCL. 
Alternatively compute unit memory can be utilized. Compute 
unit memory accesses have a much lower latency than global 
memory access by approximately an order of magnitude. Using 
compute unit memory also means that there will be less demand 
on the memory controllers. 

On AMD for example, multiple accesses to a specific mem- 
ory controller are serialized when there is a conflict. Atomic 
operations should be avoided if possible on the AMD architec- 
ture as a single atomic operation can dramatically reduce all 
other memory operations. Unrolling loops can also help the 
compiler take advantage of the memory access structure but it 
also increases register use l35ll . 

Another direction to improve the present implementation 
is the use of more sophisticated propagation algorithms than 
the Taylor and the Runge-Kutta. An important candidate, wor- 
thy of consideration, should be the Arnoldi/Lanczos algorithm. 
Whilst other theoretical studies have remarked that the Taylor 
propagator is both simple and reliable ll34ll they argue that it 
is slower than the Lanczos propagator, mainly due to smaller 
timesteps which the Taylor propagator requires 

HEl- An 

optimized Taylor propagator should lend itself towards the con- 
struction of the Krylov subspace for a Lanczos propagator since 
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Appendix A. OpenCL compute devices tested 

AMD FirePro V7800. The AMD FirePro V7800 is a PCI-e 
xl6 connected graphics card 13711 with 288 processing cores. 
Each core consists of four arithmetic logic units (ALU) and a 
transcendental unit which are fed instructions through a Very 
Long Instruction Word (VLIW). The ALUs can be thought in 
OpenCL terms as a processing element. For double precision 
the transcendental unit is not used and the remaining four are 
grouped into two double precision execution units, thus there 
are two double precision processing elements per processing 
core. This means that for practical purposes 576 double preci- 
sion instructions can be executed simultaneously. For floating 
point calculations 1152 instructions can be executed. The pro- 
cessing cores are grouped into compute units. Obviously the 
actual number of instructions executed in a cycle is dependent 
on the form of the workload. A compute unit (a SIMD proces- 
sor) consists of 16 of the processing cores; as a result there are 
18 compute units. 1 Gigabyte of global memory is available as 
well as 32KB of memory per compute unit. Each processing 
element has access to a pool of registers (256KB per compute 
unit). Global memory is accessed with GDDR5. The core clock 
is 700 MHz. 

NVIDIA Tesla S1070 Computing Systems. The NVIDIA Tesla 
S1070 computing system consists of multiple Tesla T10 GPUs 
which are based on the GeForceGTX 200 GPU Q. Each GPU 
contains 240 scalar processing cores and 4GB of memory [ T3S 
Currently a single GPU is targeted. 
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Intel Xeon. The Intel® Xeon® W3503 |40[] used is a 64 bit 
dual core CPU with a clock speed of 2.4 GHz, a 4 MB cache 
and with support for DDR3 memory with a 25.6 GB/s memory 
bandwidth. 

Appendix B. The Taylor and Runge-Kutta time propaga- 
tors 

The Truncated Taylor Series Propagator. Within this single- 
step algorithm the forwarded solution is obtained as: 



c(t+dt) = Y ^c ( "\t) 



(B.l) 



where C (/,) (f) is the n-th derivative of C(f) at time t. A recursive 
expression for the required n-th derivatives of the coefficient 
vector can be retrieved by successive integrations of Equation 
© as: 



C w (f) 



-H(/)C w (f) 



(B.2) 



where the zero derivative is equal to C(t). To arrive at this 
expression one shall assume the time derivative of the Hamil- 
tonian itself, within the forwarded time interval dt, is much 
smaller than the rate of change of the coefficients. Particularly 
for the present problem, this latter assumption is an excellent 
approximation, provided that the chosen time step dt is much 
smaller the field's period Iji/oj, in other words, dt « 2n/a>. It 
can be shown that this expression does indeed give an approxi- 
mate form of the unitary operator. In practical calculations, the 
above expression consists of a Taylor series truncated to some 
order N, which in combination with the time step dt sets the 
order of accuracy of the solution. Finally, from this expression, 
after calculation of the derivatives of the coefficient at a known 
time t, they are combined together in a summation in order to 
then calculate the wavefunction at a later time t + dt according 
to Equation lB.il The calculation for each step consists of N 
matrix-vector multiplications and N vector additions onto the 
solution of the system a step later. 

Explicit Runge-Kutta Methods. Runge-Kutta methods have been 
some of the most widely used methods for solving ordinary dif- 
ferential Equations l4ltl . Runge-Kutta methods use information 
from several steps to approximate a Taylor expansion 11421 pg 
906]. The class of explicit Runge-Kutta methods is expressed 
in the form [43]: 



C(t + dt) = CiO + dtJ^biC 
C (0 = 



i=l 



t + Cidt, C(t) + dt V atjC 



(J) 



(B.3) 
(B.4) 



where f () represents the derivative of C. 

A number of Runge-Kutta methods exist such as the classic 
fourth order Runge-Kutta method and the 4th(5th) order em- 
bedded pair Fehlberg method [44]. In the Runge-Kutta-Fehlberg 



method 4th and 5th order steps are calculated using the same 
derivative calculation information; the difference between the 
two methods gives an indication of the local error size. 
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